Monkeypox (mpox) is a re-awakening Animal-borne disease delineated by skin lesions that are resembling to other dermatological disorders, making accurate and prompt diagnosis problematic. Automated medical image interpretation using deep learning has shown over the approaches often struggle with very limited and imbalanced datasets which problems in getting accuracy. In this study, the proposed hybrid deep learning framework technique which helps and that integrates transformer-based global feature extraction which also uses convolutional neural network (CNN)-based local feature learning for monkeypox disease classification. The recommended architecture model compounds a distilled Vision Transformer (DeiT) and EfficientNet-B4 through a learnable cross-attention gating mechanism, enabling adaptive fusing of global and local description. Furthermore, in enhancement and performance classification for datasets, utilizing a combined loss function that integrates cross-entropy and focal loss, supplemented by dynamic class weighting. A three phase training is used where 1) unfreezing network layers is adopted to stabilize training and improve generalization, 2) procedures are conducted with the publicly available monkeypox skin lesion datasets, examined through stratified k-fold cross validation method, 3) the model is evaluated using accuracy, F1- score, and other classification metrics. This paper demonstrates the effectiveness of hybrid transformer-CNN architectures with adaptive feature for monkeypox image classification and presents a model for real-world diagnostic applications, while also addressing problems related to dataset quality and generalization.
Introduction
This text presents a study on improving the diagnosis of visually similar diseases—monkeypox, chickenpox, measles, and normal skin—using an advanced AI model.
It highlights that these diseases are difficult to distinguish due to similar skin symptoms, and traditional methods like PCR testing are slow and impractical in low-resource settings. Existing AI approaches also have limitations: CNNs capture fine details but miss overall patterns, while transformer models capture global context but require large datasets.
To solve this, the study introduces MpoxNetV, a hybrid model combining a Vision Transformer (DeiT) for global feature analysis and EfficientNet-B4 for detailed local feature extraction. A cross-attention gating mechanism dynamically balances these two inputs for each image. The model is trained using a three-phase strategy (freezing, partial unfreezing, and full fine-tuning) along with a combined loss function to handle class imbalance.
The system is evaluated using stratified 5-fold cross-validation, achieving strong and reliable performance with an average accuracy of about 93.5%. Results show that combining global and local features, along with adaptive training, significantly improves classification accuracy and robustness compared to traditional methods.
Conclusion
This paper presented MpoxNetV, a hybrid deep learning framework that uses an adaptive cross-attention gating mechanism to successfully integrate the global feature extraction of DeiT with the local detail capture of EfficientNet-B4. The model outperformed previous CNN-only and hybrid approaches, such as Uysal et al.’s 87% on the same four-class task (monkeypox, chickenpox, measles, normal skin), with a mean validation accuracy of 93.51% ±1.14% under rigorous 5-fold stratified cross-validation. It was trained using a three-phase unfreezing strategy and a combined cross-entropy focal loss on imbalanced monkeypox skin lesion datasets. Key breakthroughs include a smart” fusion gate” that dynamically weights branches per image, tailored handling for small or imbalanced medical datasets, and tough evaluations that directly address the data quality concerns flagged by Vega et al. These findings confirm why hybrid transformer-CNN models outperform others for tricky, visually complex skin conditions—paving a practical, scalable way for automated diagnosis in resource-limited settings where PCR test delays could be dangerous. These advancements could transform MpoxNetV into a production-ready tool, bridging the gap between research and global health equity.
MpoxNetV presents a promising foundation for automated skin lesion classification; however, several directions can further enhance its effectiveness and real-world applicability:
1) Dataset Expansion and Multimodal Learning: Future work will focus on incorporating larger, clinically validated datasets encompassing diverse skin tones, lesion stages, and comorbid conditions. Additionally, integrating patient metadata (e.g., symptoms, fever history) using multimodal transformer architectures can improve diagnostic accuracy.
2) Edge Deployment and Model Optimization: To enable real-time inference in resource-constrained environments, lightweight variants (e.g., Efficient Net-Lite) can be explored. Model compression and optimization techniques will facilitate deployment on mobile devices and field diagnostic tools.
3) Few-Shot and Domain Adaptation: Investigating zero shot and few-shot learning approaches, such as prompt tuning, can improve generalization to unseen disease variants. Domain adaptation techniques can further ensure robustness across different datasets, imaging conditions, and clinical settings.
4) Clinical Validation Studies: Conducting real-world clinical trials comparing MpoxNetV with expert dermatologists will provide insights into diagnostic speed, reliability, and practical utility, particularly in low-resource healthcare environments.
5) Explainability and Interpretability: Incorporating explainability techniques such as Grad-CAM++ and attention map visualization within the cross-attention framework can enhance transparency, enabling clinicians to better understand and trust model predictions.
References
[1] F. Uysal, “Detection of monkeypox disease from human skin images with a hybrid deep learning model,” Diagnostics, vol. 13, no. 10, p. 1772, 2023. https://doi.org/10.3390/diagnostics13101772
[2] A. Jaradat et al., “Automated monkeypox skin lesion detection using deep learning models,” Diagnostics, vol. 13, no. 5, p. 954, 2023. https:
//doi.org/10.3390/diagnostics13050954
[3] J. A. Vega, C. Granados, and P. Fontelo, “Analysis: Flawed datasets of monkeypox skin images,” Diagnosis, vol. 10, no. 2, pp. 61–63, 2023. https://doi.org/10.1515/dx-2022-0099
[4] K. S. Q. Al-Hammuri et al., “Vision transformer architecture and applications in digital health: Tutorial and survey,” Visual Computing for Industry, Biomedicine, and Art, vol. 6, p. 22, 2023. https://doi.org/10.1186/s42492-023-00140-9
[5] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2021. https://arxiv.org/abs/2010.11929
[6] H. Touvron et al., “Training data-efficient image transformers & distillation through attention,” in Proc. ICML, 2021.
[7] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. ICML, 2019.
[8] K. Ali et al., “Monkeypox skin lesion detection using deep learning models: A feasibility study,” arXiv:2207.03342, 2022. https://arxiv.org/abs/2207.03342
[9] M. Dwivedi, R. G. Tiwari, and N. Ujjwal, “Deep learning methods for early detection of monkeypox skin lesion,” in Proc. 2022 IEEE ICSC, Noida, India, 2022, pp. 343–348. https://doi.org/10.1109/ICSC56524. 2022.10009571
[10] C. Sitaula et al., “Detection of monkeypox from skin lesion images using deep learning networks and explainable AI,” J. Integrative Bioinformatics, vol. 20, no. 4, p. 20230025, 2023.
[11] A. K. Singh, B. Kadhiwala and R. Patel, \"Hand-written Hindi Character Recognition - A Comprehensive Review,\" 2021 2nd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 2021, pp. 1-5, doi: 10.1109/GCAT52182.2021.9587554.
[12] Uddyalok Chakraborty, D. Thilagavathy, Suresh Kumar Sharma and Awadh Kishore Singh, “Hybrid Deep Learning with Alexnet Feature Extraction and Unet Classification for Early Detection in Leaf Diseases”, ICTACT Journal on Soft Computing Vol. 14, No. 3, pp. 3255-3262, 2024.
[13] Vyas, Mehali, Awadh Kishor Singh, and Nidhi Parmar. \"ANALYZING LANGUAGE IN MULTILINGUAL SPEECH USING DEEP NEURAL NETWORK.\".
[14] N. N. Soe et al., “Using AI to differentiate mpox from common skin lesions in sexual health clinics,” JMIR Dermatology, vol. 7, p. e54321, 2024.
[15] A. A. Mohammed et al., “Deep learning based detection of monkeypox virus using skin lesion images,” Computers in Biology and Medicine, vol. 159, p. 106944, 2023.
[16] M. Saha et al., “Deep learning-based mpox skin lesion detection and classification,” Diagnostics, vol. 15, no. 19, p. 2487, 2025.
[17] Y. Cao et al., “Robustly detecting mpox and non-mpox using a deep generative model,” Scientific Reports, vol. 15, p. 85771, 2025.
[18] M. M. Ahsan et al., “Human monkeypox classification from skin lesion images using deep learning,” Diagnostics, vol. 12, no. 10, p. 2438, 2022.
[19] H. Barzekar et al., “Monkeypox disease detection using deep learning:Systematic literature review,” J. Integrative Bioinformatics, vol. 20, no.4, p. 20230028, 2023.